14 research outputs found

    Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications

    Full text link
    With the increasing importance of safety requirements associated with the use of black box models, evaluation of selective answering capability of models has been critical. Area under the curve (AUC) is used as a metric for this purpose. We find limitations in AUC; e.g., a model having higher AUC is not always better in performing selective answering. We propose three alternate metrics that fix the identified limitations. On experimenting with ten models, our results using the new metrics show that newer and larger pre-trained models do not necessarily show better performance in selective answering. We hope our insights will help develop better models tailored for safety-critical applications

    PMU Tracker: A Visualization Platform for Epicentric Event Propagation Analysis in the Power Grid

    Full text link
    The electrical power grid is a critical infrastructure, with disruptions in transmission having severe repercussions on daily activities, across multiple sectors. To identify, prevent, and mitigate such events, power grids are being refurbished as 'smart' systems that include the widespread deployment of GPS-enabled phasor measurement units (PMUs). PMUs provide fast, precise, and time-synchronized measurements of voltage and current, enabling real-time wide-area monitoring and control. However, the potential benefits of PMUs, for analyzing grid events like abnormal power oscillations and load fluctuations, are hindered by the fact that these sensors produce large, concurrent volumes of noisy data. In this paper, we describe working with power grid engineers to investigate how this problem can be addressed from a visual analytics perspective. As a result, we have developed PMU Tracker, an event localization tool that supports power grid operators in visually analyzing and identifying power grid events and tracking their propagation through the power grid's network. As a part of the PMU Tracker interface, we develop a novel visualization technique which we term an epicentric cluster dendrogram, which allows operators to analyze the effects of an event as it propagates outwards from a source location. We robustly validate PMU Tracker with: (1) a usage scenario demonstrating how PMU Tracker can be used to analyze anomalous grid events, and (2) case studies with power grid operators using a real-world interconnection dataset. Our results indicate that PMU Tracker effectively supports the analysis of power grid events; we also demonstrate and discuss how PMU Tracker's visual analytics approach can be generalized to other domains composed of time-varying networks with epicentric event characteristics.Comment: 10 pages, 5 figures, IEEE VIS 2022 Paper to appear in IEEE TVCG; conference encourages arXiv submission for accessibilit

    Image or Information? Examining the Nature and Impact of Visualization Perceptual Classification

    Full text link
    How do people internalize visualizations: as images or information? In this study, we investigate the nature of internalization for visualizations (i.e., how the mind encodes visualizations in memory) and how memory encoding affects its retrieval. This exploratory work examines the influence of various design elements on a user's perception of a chart. Specifically, which design elements lead to perceptions of visualization as an image or as information? Understanding how design elements contribute to viewers perceiving a visualization more as an image or information will help designers decide which elements to include to achieve their communication goals. For this study, we annotated 500 visualizations and analyzed the responses of 250 online participants, who rated the visualizations on a bilinear scale as image or information. We then conducted an in-person study (n = 101) using a free recall task to examine how the image/information ratings and design elements impact memory. The results revealed several interesting findings: Image-rated visualizations were perceived as more aesthetically appealing, enjoyable, and pleasing. Information-rated visualizations were perceived as less difficult to understand and more aesthetically likable and nice, though participants expressed higher positive sentiment when viewing image-rated visualizations and felt less guided to a conclusion. We also found different patterns among participants that were older. Importantly, we show that visualizations internalized as images are less effective in conveying trends and messages, though they elicit a more positive emotional judgment, while informative visualizations exhibit annotation focused recall and elicit a more positive design judgment. We discuss the implications of this dissociation between aesthetic pleasure and perceived ease of use in visualization design.Comment: 11 pages, 10 figures, 3 tables, accepted at IEEE Vis 202

    LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity

    Full text link
    Cross-task generalization is a significant outcome that defines mastery in natural language understanding. Humans show a remarkable aptitude for this, and can solve many different types of tasks, given definitions in the form of textual instructions and a small set of examples. Recent work with pre-trained language models mimics this learning style: users can define and exemplify a task for the model to attempt as a series of natural language prompts or instructions. While prompting approaches have led to higher cross-task generalization compared to traditional supervised learning, analyzing 'bias' in the task instructions given to the model is a difficult problem, and has thus been relatively unexplored. For instance, are we truly modeling a task, or are we modeling a user's instructions? To help investigate this, we develop LINGO, a novel visual analytics interface that supports an effective, task-driven workflow to (1) help identify bias in natural language task instructions, (2) alter (or create) task instructions to reduce bias, and (3) evaluate pre-trained model performance on debiased task instructions. To robustly evaluate LINGO, we conduct a user study with both novice and expert instruction creators, over a dataset of 1,616 linguistic tasks and their natural language instructions, spanning 55 different languages. For both user groups, LINGO promotes the creation of more difficult tasks for pre-trained models, that contain higher linguistic diversity and lower instruction bias. We additionally discuss how the insights learned in developing and evaluating LINGO can aid in the design of future dashboards that aim to minimize the effort involved in prompt creation across multiple domains.Comment: 13 pages, 6 figures, Eurovis 202

    PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models

    Full text link
    Large Language Models (LLMs) have gained widespread popularity due to their ability to perform ad-hoc Natural Language Processing (NLP) tasks with a simple natural language prompt. Part of the appeal for LLMs is their approachability to the general public, including individuals with no prior technical experience in NLP techniques. However, natural language prompts can vary significantly in terms of their linguistic structure, context, and other semantics. Modifying one or more of these aspects can result in significant differences in task performance. Non-expert users may find it challenging to identify the changes needed to improve a prompt, especially when they lack domain-specific knowledge and lack appropriate feedback. To address this challenge, we present PromptAid, a visual analytics system designed to interactively create, refine, and test prompts through exploration, perturbation, testing, and iteration. PromptAid uses multiple, coordinated visualizations which allow users to improve prompts by using the three strategies: keyword perturbations, paraphrasing perturbations, and obtaining the best set of in-context few-shot examples. PromptAid was designed through an iterative prototyping process involving NLP experts and was evaluated through quantitative and qualitative assessments for LLMs. Our findings indicate that PromptAid helps users to iterate over prompt template alterations with less cognitive overhead, generate diverse prompts with help of recommendations, and analyze the performance of the generated prompts while surpassing existing state-of-the-art prompting interfaces in performance

    How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation

    No full text
    Models that top leaderboards often perform unsatisfactorily when deployed in real world applications; this has necessitated rigorous and expensive pre-deployment model testing. A hitherto unexplored facet of model performance is: Are our leaderboards doing equitable evaluation? In this paper, we introduce a task-agnostic method to probe leaderboards by weighting samples based on their 'difficulty' level. We find that leaderboards can be adversarially attacked and top performing models may not always be the best models. We subsequently propose alternate evaluation metrics. Our experiments on 10 models show changes in model ranking and an overall reduction in previously reported performance- thus rectifying the overestimation of AI systems' capabilities. Inspired by behavioral testing principles, we further develop a prototype of a visual analytics tool that enables leaderboard revamping through customization, based on an end user's focus area. This helps users analyze models' strengths and weaknesses, and guides them in the selection of a model best suited for their application scenario. In a user study, members of various commercial product development teams, covering 5 focus areas, find that our prototype reduces pre-deployment development and testing effort by 41% on average

    Localized ridge defect augmentation using human pericardium membrane and demineralized bone matrix

    No full text
    Background: Patient wanted to restore her lost teeth with implants in the lower left first molar and second premolar region. Cone beam computerized tomography (CBCT) revealed inadequate bone width and height around future implant sites. The extraction socket of second premolar area revealed inadequate socket healing with sparse bone fill after 4 months of extraction. Aim: To evaluate the clinical feasibility of using a collagen physical resorbable barrier made of human pericardium (HP) to augment localized alveolar ridge defects for the subsequent placement of dental implants. Materials and Methods: Ridge augmentation was done in the compromised area using Puros® demineralized bone matrix (DBM) Putty with chips and an HP allograft membrane. Horizontal (width) and vertical hard tissue measurements with CBCT were recorded on the day of ridge augmentation surgery, 4 month and 7 months follow-up. Intra oral periapical taken 1 year after implant installation showed minimal crestal bone loss. Results: Bone volume achieved through guided bone regeneration was a gain of 4.8 mm horizontally (width) and 6.8 mm vertically in the deficient ridge within a period of 7 months following the procedure. Conclusion and Clinical Implications: The results suggested that HP Allograft membrane may be a suitable component for augmentation of localized alveolar ridge defects in conjunction with DBM with bone chips
    corecore